Statistical computation and visualization (MATH-517)
Modern society has been become increasingly dependent on the use of airplanes during the past decades. Despite the impressive engineering feat that aviation represents, airplanes and other aircraft sometimes fail, occasionally fatally. Understanding the causes of failure and considering what can be done to address these is of crucial interest for regulatory bodies and aviation authorities, but also for passengers. In the following dataset, we consider an extensive collection of incidents involving aircraft of different types across Brazil in the period of time 2006-2015. Various data have been collected for each incident. In what follows, we set out to investigate common features of these aircraft incidents.
The main issue is to understand the factors that favor a plane crash, especially in Brazil in a 10-year period. Similarities between the accidents will be sought. Perhaps would it be possible to identify risk factors thereby improving passenger safety?
Here are some more specific questions arising from the main problem that we will try to answer:
How are the accidents distributed on a map? Are there areas where accidents are more concentrated?
How do the accidents evolve through time? Is the occurrence rate constant? Are there periods with more accidents, i.e. during school vacation, summer? Do they happen at a certain time of the day?
Does the age of the aircraft play a role? Are old planes more prone to accidents?
What about the characteristics of the aircraft? Is the model, the mass or the number of engines take a part in causing a crash?
What are the main occurrence types or cause of accidents?
What can we say about damage severity of the accidents? Are there correlations?
We employed linear, logistic and quantile regression to investigate the association between aircraft lifetime, accident severity and damage level.
This section is devoted to the study of the places where aircraft accidents occur. Indeed, it is essential to understand where they occur in order to identify more risk factors.
First, a comparison between the Brazilian states allows us to draw a first conclusion (Figure 1). Note that accidents are cumulative over the 10-year period.
Figure 1: Number of accidents from 2006 to 2015 by state in Brazil
All the states have less than 200 accidents, more precisely less than 170 (Rio Grande do Sul), except one state: São Paulo. This state is not larger than the others (it is rather medium in size), which indicates that there is clearly a concentration of accidents in this area. Note that ten accidents are not shown on the map (Table 1). Eight of them occurred outside of Brazil and two more for other reasons: one ended up in international waters and the other does not have its location identified.
| Countries | ARGENTINA | BRAZIL | COLOMBIA | ENGLAND | PARAGUAY | PERU | URUGUAY |
| Accidents | 1 | 2 | 1 | 1 | 2 | 1 | 2 |
An explanation for this observation is linked in particular to demography. Indeed, the state of São Paulo is the most populous in Brazil with more than 41 million inhabitants in 2010 (Wikipedia, 2021a). Air traffic is therefore concentrated there, especially since the largest airports of the country are located in this state (for example São Paulo Guarulhos International Airport and São Paulo Congonhas Airport). The risk of accidents is therefore higher.
To be more precise in the geographical approach, it is possible to establish a map according to the location of the accident (nearest city). As the geographic coordinates are not included in the data, it is possible to use geocoding given that the city is available. For this, we used two packages: ggmap and Nominatim. An API key is required to geocode for each package (Google Maps for ggmap and MapQuest for Nominatim). As the locations are quite precise, there are often errors in the coordinates returned by geocoding, which is why we used two packages by selecting the most reasonable coordinates. It is still possible to have some deviations between the city and the place shown on the map, but the latter should not be greater than a degree of longitude and latitude.
In Figure 2, accidents are represented in clusters which are scattered by zooming. For each accident, information is available such as the model, the manufacturer, the date or the time. Note that 11 more accidents have been removed compared to the last map. The reason is that the location has not been identified.
Figure 2: Number of accidents from 2006 to 2015 clustered by location in Brazil
We still notice the same thing, that is to say a concentration of accidents in the Southeast Region of Brazil. This is explained by the fact that the traffic there is the most dense, as this article confirms in particular (Oliveira et al., 2020). It turns out that some cities are noteworthy in terms of the number of accidents. Table 2 represents the ten cities with the most accidents.
| Cities | Rio De Janeiro | São Paulo | Goiânia | Brasília | Manaus | Belo Horizonte | Campo Grande | Londrina | Bragança Paulista | Porto Alegre |
| Accidents | 65 | 50 | 42 | 31 | 29 | 26 | 24 | 23 | 22 | 22 |
A demographic explanation is still reasonable. The first six cities in Table 2 are among the ten most populous cities in Brazil (Wikipedia, 2021b). Moreover, they also have airports on the list of the 20 busiest airports in Brazil (Wikipedia, 2021c).
Looking more closely at the accidents in these cities, we see that all kinds of aircraft are represented, whether they are airliners, tourist planes or even helicopters. To learn more in this direction, it is possible to explore the accidents on the map (Figure 2).
In this section, we consider whether there is any association between the age of an aircraft and the severity and damage level of the incident. To begin, we have plotted the cumulative distribution function of the aircrafts’ age in Figure 3 below.
Figure 3: Cumulative distribution function of the aircraft’s lifetime, denoted by \(X\)
We observe that more than 80% of the aircraft are between 0 and 40 years old at the time of the incident, with a small number of aircraft reaching nearly 80 years.
Next, we provide a scatter plot to visually inspect the association between age and damage level in (4)
Figure 4: Scatter plot of damage level (coded on a 0-3 scale) versus aircraft lifetime in recorded incidents
and likewise between aircraft lifetime damage level (outcome) in (5)
Figure 5: Scatter plot of severity versus aircraft lifetime in recorded incidents
To investigate the marginal association between age and damage level we perform regression analysis using a linear model, logistic model in addition to quantile regression.
The fit diagnostic for the linear regression of damage level on lifetime of the plane gives
Call:
lm(formula = df.damage.age$damage_level ~ df.damage.age$lifetime)
Residuals:
Min 1Q Median 3Q Max
-1.8694 0.1309 0.1334 0.1356 1.1392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8693569 0.0341994 54.661 <2e-16 ***
df.damage.age$lifetime -0.0001235 0.0012004 -0.103 0.918
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7949 on 1884 degrees of freedom
Multiple R-squared: 5.621e-06, Adjusted R-squared: -0.0005252
F-statistic: 0.01059 on 1 and 1884 DF, p-value: 0.918
Here, we have coded the damage level on a scale of 0 to 3. Furthermore, the confidence interval is given by
beta 2.5 % 97.5 %
(Intercept) 1.8693568526 1.802284164 1.936429541
df.damage.age$lifetime -0.0001235299 -0.002477756 0.002230697
which is small and includes the null value of the slope.
Likewise, the logistic regression of aircraft lifetime on severity (with corresponding confidence intervals) is given below:
Call:
glm(formula = classification ~ lifetime, family = "binomial",
data = df.damage.age)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8589 -0.7911 -0.7661 1.5910 1.6827
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.13355 0.09886 -11.466 <2e-16 ***
lifetime 0.00413 0.00342 1.208 0.227
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2171.2 on 1885 degrees of freedom
Residual deviance: 2169.8 on 1884 degrees of freedom
AIC: 2173.8
Number of Fisher Scoring iterations: 4
(Intercept) lifetime
0.321890 1.004138
OR 2.5 % 97.5 %
(Intercept) 0.321890 0.2646635 0.390004
lifetime 1.004138 0.9974175 1.010886
Once again, the confidence interval includes the null value. Finally, we consider the quantile regression coefficient from the quantile regression model \[ Q_Y(\tau\mid X) = a_0(\tau) + b_0(\tau)X \] where the outcome \(Y\) is airplane lifetime and we take accident severity as an exposure \(X\). The resulting regression coefficient is plotted in Figure (6)
Figure 6: Quantile regression of damage level on the quantile of aircraft lifetime
The quantile plot shows that the conditional cumulative distribution function, conditioning on severity level ‘serious incident,’ is narrower compared to the conditional cumulative distribution function, conditioning on severity level ‘accident.’ In other words, both high and low quantiles are shifted towards the median.
As we did not find any strong associations between damage level and age of the aircraft marginally in the population. This motivated us to examine further whether such associations could exist within subsets of the population, such as the strata of incidents involving helicopters. Once again, we perform a logistic regression, which yields the following fit diagnostic and confidence intervals:
Call:
glm(formula = classification ~ lifetime, family = "binomial",
data = df.helicopters)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8589 -0.7911 -0.7661 1.5910 1.6827
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.13355 0.09886 -11.466 <2e-16 ***
lifetime 0.00413 0.00342 1.208 0.227
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2171.2 on 1885 degrees of freedom
Residual deviance: 2169.8 on 1884 degrees of freedom
AIC: 2173.8
Number of Fisher Scoring iterations: 4
(Intercept) lifetime
0.321890 1.004138
OR 2.5 % 97.5 %
(Intercept) 0.321890 0.2646635 0.390004
lifetime 1.004138 0.9974175 1.010886
We do not find a strong association between accident severity and aircraft age within the stratum of helicopters either (the confidence interval for the lifetime coefficient includes the null-value, and the p-value of the coefficient is 0.227).
We employed linear, logistic and quantile regression to investigate the association between aircraft lifetime, accident severity and damage level, but did not find any statistically significant associations.